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Abstract 

We develop methods for parameter estimation in settings with large-scale data sets, 
where traditional methods are no longer tenable. Our methods rely on stochastic approx¬ 
imations, which are computationally efficient as they maintain one iterate as a parameter 
estimate, and successively update that iterate based on a single data point. When the 
update is based on a noisy gradient, the stochastic approximation is known as standard 
stochastic gradient descent, which has been fundamental in modern applications with large 
data sets. Additionally, our methods are numerically stable because they employ implicit 
updates of the iterates. Intuitively, an implicit update is a shrinked version of a stan¬ 
dard one, where the shrinkage factor depends on the observed Fisher information at the 
corresponding data point. This shrinkage prevents numerical divergence of the iterates, 
which can be caused either by excess noise or outliers. Our sgd package in R offers the 
most extensive and robust implementation of stochastic gradient descent methods. We 
demonstrate that sgd dominates alternative software in runtime for several estimation 
problems with massive data sets. Our applications include the wide class of generalized 
linear models as well as M-estimation for robust regression. 

Keywords: stochastic gradient descent, implicit updates, massive data, exponential family, 
generalized linear models, M-estimation. 


1. Introduction 

Massive data sets as well as streaming data, in which one observes only a group of data points 
at a time, are becoming increasingly common in modern statistical analysis. Under the setting 
of hundreds of millions of observations and hundreds or thousands of covariates (National 
Research Council 2013), it becomes difficult to estimate the parameters of a statistical model; 
the three ideal properties are computational efficiency, statistical optimality, and numerical 
stability, and it is challenging to address all three with a single estimation method. 
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More formally, suppose there exists a vector of parameters 0* G and that we observe 
i.i.d. samples D = {x„,y„}, for n = 1, 2,..., iV; in the data point (x„,y„), the out¬ 
come Yn G is distributed conditional on covariates x„ G MP according to a known den¬ 
sity /(yn;x„,0*), and thus the log-likelihood function for the entire data set D is given by 
£{9-, D) = J2n=i f{yn', Xn, 9). The task is to estimate the true parameter value 0* when N 
is infinite (streaming setting), or to approximate some estimator of 0 *, such as the maximum- 
likelihood estimator 0™*® = argmax^gjip £(0; D), when N is hnite. 

Widely used methods for statistical estimation, such as Fisher scoring, the EM algorithm, and 
iteratively reweighted least squares (Fisher 1925; Dempster, Laird, and Rubin 1977; Green 
1984) are not feasible in such settings; either they strictly do not apply in the streaming 
setting (infinite N), or they do not scale to large data (finite but large N). Fisher scoring, for 
example, requires at each iteration the inversion oi a p x p matrix and evaluation of the log- 
likelihood over the full data set D. This roughly yields 0{Np'^~^’') running time complexity, 
which is prohibitive when N and p are large. In contrast, estimation with massive data sets 
typically requires a running time complexity that is 0{Np^~'^), i.e., that is linear in N but 
sublinear in the parameter dimension p. 

Such performance is achieved in general by the stochastic gradient descent (sgd) algorithm, 
which was initially proposed by Sakrison (1965) as a modihcation of the Robbins-Monro 
procedure (Robbins and Monro 1951) for recursive estimation. It is defined through the 
iteration 

+ 7naVlog/(y„;x„,e-i). ( 1 ) 

We will refer to Equation 1 as SGD with explicit updates, or explicit SGD for short, because the 
next iterate 9^'^ can be computed immediately after the data point (x„,y„) is observed. 
The sequence 7 ^ > 0 is the learning rate sequence, and is typically defined such that —)• 
7 > 0 as n —)• 00 ; the hyperparameter 7 > 0 is fixed and known as the learning rate parameter. 
The sequence {Cn} is a sequence of positive-definite matrices, such that Cn ^ C with C 
known, and is used to better condition the iteration; in the simplest case Cn = 7, i.e., we 
simply use the identity matrix, which results in first-order explicit SGD. 

From a computational perspective, explicit SGD is efficient because it replaces the expensive 
inversion oi p x p matrices, as in Fisher scoring, by a scalar sequence 7 ^ > 0 and a matrix 
Cn that is fast to manipulate numerically, by design. Furthermore, the log-likelihood is 
evaluated at a single observation given x„, rather than the entire data set D, which saves 
significant computation time. From a theoretical perspective, explicit SGD is justified because 
the theory of stochastic approximations (Robbins and Monro 1951, Theorem 1) implies that 
9^'^ converges to a point 0oo such that E (V log /(yn; x„, 6 * 00 )) = 0. Under standard statistical 
theory, E (V log f(yn', Xn, 0*)) = 0, and this point is unique under typical regularity conditions 
(Lehmann and Casella 1998, Theorem 5.1, p.463), such as concavity of log-likelihood; this 
is true, for example, in the popular exponential family of statistical models (Brown 1986). 
Therefore, 0oo = he., explicit SGD converges to the true parameter value. In the finite 
N setting, a similar condition holds where 9^'^ approximates 0™^® if the data point in 
Equation 1 is an unbiased sample from the total N data points; see also Toulis and Airoldi 
(2015b) for a review of applications of SGD on modern machine learning applications. 

Despite these theoretical guarantees, explicit SGD requires careful tuning of the hyperparam- 
eter 7 in the learning rate: small values of the parameter make the iteration ( 1 ) very slow 
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to converge in practice, whereas large values can cause numerical divergence. Moreover, it is 
known that explicit SGD is statistically inefficient even when 7 is correctly specified (Toulis, 
Airoldi, and Rennie 2014). In particular, the amount of information loss from procedure (1) 
depends on the spectral gap of the Fisher information matrix, T{0) = —E (V^ log f{yn', 6 )), 
calculated at the true parameter value 0 = 0*. A large spectral gap makes it hard, or even im¬ 
possible, to make the learning rates large enough for fast convergence, and also small enough 
for stability (Toulis and Airoldi 2015a, Section 3.5). 

Motivated by these challenges, Toulis, Tran, and Airoldi (2015) introduced averaged implicit 
stochastic gradient descent (ai-SGd), which is dehned by the procedure 

0r = + JnCnV log f{yn, Xn, 0'^), (2) 

n 

9„ = (l/n)^9”. (3) 

i=l 

The first key component of Ai-SGD is the implicit update (2). Note that it is implicit because 
the next iterate 0™ appears on both sides of the equation. This simple modihcation of the 
explicit SGD procedure offers several statistical advantages. In particular, assuming a common 
starting point = 0™ i = 0o, one can show through a Taylor approximation of (2) around 
00 that the implicit update satisfies 

A0r = {I + lnCn±{Bo\ Xn, Yn))”'A0^gd + (4) 

where A0„ = On — Bn-i for both methods, I is the identity matrix, and X(0o;x„,y„) = 
—V^.^(0o; Xn, Yn) is the observed Fisher information matrix at 0o (equivalent to the Hessian of 
the negative log-likelihood at 0o). Equation 4 implies that the implicit update (2) is a shrinked 
version of the explicit update (1). This shrinkage makes the iterations significantly more stable 
in small-to-moderate samples, and also robust to misspecifications of the learning rate param¬ 
eter 7 (Toulis et al. 2014). The implicit update (2) also has a Bayesian interpretation, where 
0 ™ is the posterior mode of a model with the standard multivariate normal AA(0™ ynCn) 
as the prior, and /(0;xn,yn) as the likelihood. Thus it provides an iterative form of regu¬ 
larization. In optimization, update (2) is known as a proximal update, and corresponds to a 
stochastic version of the proximal point algorithm (Rockafellar 1976). Krakowski, Mahony, 
Williamson, and Warmuth (2007) and Nemirovski, Juditsky, Lan, and Shapiro (2009) have 
shown that proximal methods fit better in the geometry of the parameter space. 

The second key component of Ai-SGD is iterate averaging (3), which guarantees optimal sta¬ 
tistical efficiency under fairly relaxed conditions. Ruppert (1988) and Polyak and Juditsky 
(1992) first proved that averaging of iterates can achieve statistical optimality in the stan¬ 
dard context of stochastic approximation with explicit updates; Toulis et al. (2015) extended 
this result to the implicit SGD update (2). Thus, Ai-SGD is effectively a recursive estimation 
method that is both statistically optimal and numerically stable, while remaining applicable 
to the setting of massive and/or streaming data. 

In this paper we develop statistically efficient SGD algorithms for generalized linear models— 
extending Algorithm 1 of Toulis et al. (2014)— and also develop SGD algorithms to perform 
high-dimensional M-estimation. This allows for scalable estimation of such models with mas¬ 
sive and/or streaming data. We provide a publicly available package sgd (Tran, Toulis, and 
Airoldi 2015) written in R, which implements Ai-SGD, as well as other SGD variants. In Section 
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2, we develop the algorithms. Section 3 contains experiments on simulated and real-world 
data, in which we demonstrate the advantages of the sgd package compared to alternative 
software. In Section 4, we describe the interface of sgd and implementation details for its use 
in practice. 


2. Algorithms 

In this section we develop algorithms which implement implicit SGD and Ai-SGD for gener¬ 
alized linear models as well as M-estimation. We start by introducing an algorithm which 
efficiently computes a generalization of implicit update ( 2 ), which is useful for the aforemen¬ 
tioned applications. 

2.1. Efficient computation of implicit updates 

The main difficulty in applying Ai-SGD is the solution of the multidimensional fixed point 
equation for the implicit update (2). In the large class of models where the likelihood given 
covariate x depends on the parameter 0 only through the natural parameter rj = 6, the 

solution of the fixed-point equation is computationally efficient. The general result is given 
in Theorem 2.1, whereas the assumption is made more precise below. 

Assumption 2.1. The likelihood i{9;xn,yn) = log/(yn;x„, 0) of parameter value 6 given 
data point (x„,y„) depends on 9 only through the produet xj0, i.e., 

£( 6 »;x„,y„) = £(x^ 6 »;x„,y„). (5) 

A key implication of Assumption 2.1 is that the direction of the gradient of the log-likelihood 
does not depend on the parameter value since V log/(y„; x^, 0) = £'(xj0; x^, y„)xn, where 
the latter derivative is with respect to the natural parameter xj 0 and with fixed data iCn,Yn- 
This property is crucial because it implies that the implicit update (2) can be performed once 
a scalar value is found that will appropriately scale the gradient. 

Theorem 2.1. Suppose Assumption 2.1 holds. Then the gradient for the implicit iterate 
(2) is a scaled version of the gradient at the previous iterate, i.e., 



Vlog/(y„;Xn, 6 '™) = s„V log/(y^; x„, 0 ™ i). 

( 6 ) 

The scalar Sn 

E M satisfies 



^n^n—l — ^ (^n ^n—1 ; 

(7) 

where Kn-i = 

f(xj 0 ™i;x„,y„). 



Theorem 2.1 shows that the gradient V log/(y„,; x„, 0™) in the implicit update (2) is in fact 
a scaled version of the gradient V log/(yn,; x^, 0™ i) that would appear in update ( 2 ) if we 
were applying explicit updates. Therefore, computing the implicit update reduces to finding 
the scale factor Sn £ K. See Toulis and Airoldi (2015a, Threorem 4.1) for a proof. 

Penalized likelihood. It is possible to regularize both explicit and implicit SGD by adding 
a penalty to the log-likelihood. In particular, we consider the elastic net (Zou and Hastie 
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2005), where for some fixed a € [0,1] the penalty function is 

p^{e) = {i-a)\\\e\\l + a\\eh. ( 8 ) 

Adding the elastic net with a regularization parameter A G M to explicit SGD is straightfor¬ 
ward: 


0^®" = + 7nC'„(Vlog/(y„; x„, - AVP„(C"_",)), (9) 

where the gradient of the elastic net penalty is given by 

VP„(0n-i) = (1 - a)C-i + asign(0^®_)^^). (10) 

Here, the operation sign(0) is the element-wise sign operation, outputting 1 if 6j > 0, — 1 if 
Oj < 0, and 0 otherwise. 

For implicit SGD the update would be 

= 6'n-i +7nC'n(Vlog/(y„;x„,6'7) - AVP„(6'r))- (H) 

However, it is not generally possible to compute update (11). For example. Assumption 
2.1 does not hold because the gradient of the log-likelihood and the gradient of the penalty 
generally have two different directions. This breaks the argument of Theorem 2.1, where the 
direction of the update calculated at the next iterate 0™ is the same as the direction of the 
update calculated at the previous iterate 

To circumvent this problem, we simply penalize the previous iterate instead of the current, 
i.e., perform the update 

C = OT -1 + 7 na(Vlog/(y„;x„,C) - AVP„(0 “i)). (12) 

Then update (12) is equivalent to 

6»r = 0™! -k7nC'n(snVlog/(y„;x„,6'™i) - AVPa(0n-l)), (13) 

where the scale factor Sn satisfies 

1 — f i^^n^n—1 Paifin—l) PjnSn^^n—l^n^'nJ^ni^niyr^ i (I”!) 

and where Kn-i = x„, y^). A proof for this case with penalized likelihoods is 

identical to the proof of Theorem 2.1. 

Final algorithm for implicit updates. This analysis leads to Algorithm 1, which, for 
models satisfying Assumption 2.1, implements the most general update (13) of implicit SGD 
with conditioning matrices and penalty. This algorithm applies a root-finding procedure 
solving Equation 14 at every iteration, which is fast because the equation is one-dimensional 
and the search bounds for the solution are known, having a diminishing range 0{'yn)- Indeed, 
the one-dimensional search is computationally negligible in practice, as we see in Section 
3. 

We also note that because the implicit update (17) effectively does regularization as a shrink¬ 
age estimate (see Equation 4), the use of penalization is not as crucial in practice as it is for 
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Algorithm 1 Efficient implementation of implicit update (13) 

1: function IMPLICIT_UPDATE(f'(-; •), 7n, 6*™ 1, X„, Yn, Cn, Pa) 

2: # Compute search bounds B 

3: Tn ^ 7n^' (xj0™ i; X„, y„) 

4: B -(r- [0, r„] 

5: if Tn < 0 then 

6 : B [rmO] 

7: end if 

8: # Solve fixed-point equation by a root-finding method 

9 : ? = 6*^-1 - x ^, y ^), 

10: Sn i ^I'^n 

11: # Equivalent to implicit update (13) 

12; return 0™ ^ + 7 „C'n (s„£' (x^™ E x„, yn) - XVPa{9^ff_i)) 

13; end function 


explicit updates. We make extensive experiments using Algorithm 2 and also examine this 
effect in Section 3. 


2.2. Generalized linear models 


In the family of generalized linear models (GLMs), the outcome follows an exponential 

family distribution conditional on x„, 

1 

where the scalar ?/; > 0 is the dispersion parameter which affects the variance of the outcome, 
c(-, •) is the base measure, and 6(-) is the log normalizer which ensures that the distribution 
integrates to one.^ Additionally, in a GLM it is assumed that E(y„|x„) = /i(x7*)) where 
/i : M —)• M is known as the transfer function (Nelder and Wedderburn 1972; Dobson and 
Barnett 2008). A simple property of GLMs is that the transfer function is the first derivative 
of the log normalizer, i.e., /i(x7) = b'{'x^0), for all x„,0. 

A straightforward implementation of explicit SGD for estimation with GLMs is 

= ^n-l + InCnlyn “ /l(x7^7)]Xn- (16) 


{rtuVn - b{r]n)) > c{ynA), 


Vn = x7* 


(15) 


y-n 


exp 


Similarly, the Ai-SGD procedure can be written as 

6»r = 9^-1 + lnCn[yn “ /i(x7r)]Xn 
1 

2 = 1 

By assumption, f(0;y„,x„) oc {'x^9^)yn — 6(x7*)) and thus the log-likelihood depends on 
parameter value 0* only through its linear combination with covariate value x„. Additionally, 

^We present one-dimensional outcomes for simplicity. However, our theory easily extends to multidimen¬ 
sional outcomes. Such an extension is given, for example, in Section 2.3 on M-estimation. 
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Algorithm 2 Estimation of generalized linear models with Ai-SGD 

1 : Initialize 00 

2: for n = 1, 2,... do 

3: Define £'(xX0;x„,y„) = - /i(xj0) 

4: Calculate implicit update 

0™ ^ IMPLICIT_UPDATE(f'(•;•),7n,0j^-l,x„,yn,C„,Pa) 

5: 0n ^ ^0'n-l + ^07 

6 : end for 


Var(y„|x„) = /i'(x^0*)||x„|p, and thus h' > 0, which implies that f is twice-differentiable 
and concave, thus fulfilling Assumption 2.1. 

Penalized likelihood. As argued before, one can add the elastic penalty by applying it to 
the previous estimate instead of the current. That is, for fixed a £ [0,1] and regularization 
parameter A £ M, the Ai-SGD procedure for generalized linear models with elastic net is 


0r = 0T-1 + InCn ([Vn “ /l(x^0r)]Xn “ AVP„(0™ i)) , 



2=1 


(18) 


Algorithm 2 implements estimation of GLMs through Ai-SGD based on updates (18). 


2.3. M-Estimation 

Given a data set of N observations T) = {(x„,y„)} and a convex function p : M —>■ M"*", the 
M-estimator is defined as 


N 


Qvn 


arg mm ^ p(y^ 

n=l 



(19) 


where it is assumed y„ = x^0*+e,i, and e„ are i.i.d. zero mean-valued noise. M-estimators are 
especially useful in robust statistics (Huber 1964; Huber and Ronchetti 2009), as appropriate 
choice of p can reduce the influence of outliers in data. Recently, there has been increased in¬ 
terest in the literature for fast approximation of M-estimators due to their robustness (Donoho 
and Montanari 2013; Jain, Tewari, and Kar 2014). 

Typically in M-estimation, p is twice-differentiable around zero and 

E(p'(y„-xT0™)x,) =0, (20) 

where the expectation is over the empirical data distribution. Therefore SGD algorithms can 
be applied to approximate the M-estimator 0™. Importantly, p is convex, which implies that 
the conditions of Assumption 2.1 are met. 
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Algorithm 3 M-estimation with Ai-SGD 

1 : Initialize 00 

2: for n = 1, 2,... do 

3: Define i'{x^6;xn,yn) = -p'iYn - xj0) 

4: Calculate implicit update 

0™ ^ IMPLICIT_UPDATE(f'(•;•),7n,0j^-l,x„,yn,C„,Pa) 

5: 0n ^ ^0'n-l + ^07 

6 : end for 


The AI-SGD procedure for approximating M-estimators is 

0r = OT-l + lnCn[p'iyn " ^l0T)]^n, (21) 

= ( 22 ) 

i=l 

An outline of the procedure is given in Algorithm 3. As before, Algorithm 3 also includes the 
optional use of a sequence of conditioning matrices Cn and a penalty function Pa. The use 
of penalization has particularly been considered as a way to merge the robustness properties 
given by a choice of p with sparsity, e.g, through lasso (Owen 2007; Lambert-Lacroix, Zwald 
et al. 2011; Li, Peng, and Zhu 2011). 

It is also typical to assume that the density of Cn is symmetric around zero. Therefore, it 
also holds E {p'{yn — x)[0*)x„) = 0, where the expectation is over the true data distribution. 
Hence SGD procedures can be used to estimate 0* in the case of an infinite stream of obser¬ 
vations (N = oo). We write Algorithm 3 for the case of finite N, but it is trivial to adapt the 
procedure to infinite N. 


3. Experiments 

In this section, we compare the SGD methods implemented in the sgd package, such as ex¬ 
plicit SGD and AI-SGD, with standard, deterministic optimization methods that are widely 
used in statistical practice, such as glmnet, biglm, and speedglm. We demonstrate in both 
massive and streaming data settings that standard methods are not applicable, and further¬ 
more that SGD methods outperform such methods upon orders of magnitude in runtime and 
convergence. 

As standard methods are not competitive, we also compare the proposed SGD methods to 
each other, e.g., comparing Ai-SGD to explicit SGD, across a wide range of learning rate 
specifications, including adaptive specifications such as AdaGrad (Duchi, Hazan, and Singer 
2011) and RMSProp (Tieleman and Hinton 2012); more details on the specifications which 
are available in sgd are given in Section 4.3. 

All timings are carried out on a general-purpose 2.6 GHz Intel Core i5 processor, and are 
reported for various algorithms which reach a thresholded L 2 distance to the true parameter 
value. 
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3.1. Linear regression with the lasso 

We follow an experiment used in benchmarking the glmnet package (Friedman, Hastie, and 
Tibshirani 2010, Section 5.1), which fits GLMs with the elastic net penalty over a regulariza¬ 
tion path. As glmnet was shown to outperform related software such as elasticnet (Zou and 
Hastie 2012) and lars (Hastie and Efron 2013), we compare sgd strictly to glmnet. The design 
matrix X with N observations and p predictors is generated from a normal distribution such 
that each pair of predictors Xj,Xj/ has the same correlation p. Each of the N outcomes y^, 
n = 1, 2,... A^, is defined as 

yn = :x-n9^, + ken, (23) 

where 0*^ = (—1)-^ exp(—2(j — l)/20) so that the elements of the true parameter value 0* have 
alternating signs and are exponentially decreasing. The noise is distributed as a standard 
normal, e ~ AA(0,1), and k is chosen so that the signal-to-noise ratio is equal to 3.0. We run 
glmnet with “covariance updates”, which takes advantage of sparse updates in the parameter 
space to reduce the complexity of 0{Np) calculations per iteration. It performs better in our 
experiments than the “naive update” also considered in Eriedman et al. (2010). 

Table 1 outlines results for a combination of triplets {N,p,p), ranging from N = 1,000 ob¬ 
servations and scaling up to = 10 million, glmnet is seen to be competitive with SGD 
procedures under the setting of A" = 1,000 observations, and in fact glmnet slightly outper¬ 
forms SGD algorithms for lower dimensions of N and p. It is in any higher dimensional setting 
where sgd strictly dominates glmnet, as seen in the table where for example, with N = 50,000 
and p = 10,000, sgd is orders of magnitude faster. 

Furthermore, glmnet is restricted by the memory limitations of computer hardware. For 
example, simulations with 100,000 observations and 10,000 features require 8 GB in memory 
for simply storing the data, and more is required for parameter storage and computational 
overhead. For the sgd package, we simply stream the data points using bigmemory, which 
requires less than 500 MB of RAM for all our experiments, a 16-fold decrease in memory 
requirements. This is not possible for glmnet in either the case of real streaming data, or 
simply as a way to remove memory bottlenecks. In principle, gradient descent algorithms 
such as glmnet can read and destroy data memory from disk as it loops over the full data 
set; however, this is impractical as it requires such an expensive memory access at each 
iteration. 

We now compare the SGD algorithms. For small dimensional problems, explicit SGD achieves 
faster runtime than Ai-SGD as it does not require a one-dimensional search following Algorithm 
1. However, in high dimensions and high correlations, it becomes extremely difficult for 
explicit SGD to even converge for this toy linear model. It is sensitive to the learning rate, and 
any misspecification can cause it to diverge numerically. Thus, we were not able to obtain a 
proper timing for explicit SGD in settings of either high correlation {p > 0.9) or high dimension 
with medium correlation {p > 0.5). In practice one must tune the hyperparameter for explicit 
SGD — thus requiring significant computational overhead and user input—while also closely 
monitoring the stochastic gradients for consideration of other numerical issues. Ai-SGD on the 
other hand uses additional computation per iteration, which in high dimensions is negligible 
compared to the cost of a stochastic gradient update. This additional computation leads to 
significantly more robust updates and faster convergence. 
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Correlation 


0 

0.1 

0.2 0.5 0.9 

0.95 


N = 

1, 000 p = 100 (sec) 



sgd (niethod=" ai-sgd" 

) 

0.03 

0.03 

0.03 

0.03 

0.04 

0.34 

sgd(method="sgd") 


0.02 

0.02 

0.02 

0.02 

0.03 

0.03 

glmnet 


0.02 

0.02 

0.02 

0.02 

0.02 

0.03 




iV = 1C 

I, 000 p 

= 1, 000 (sec) 


sgd(method="ai-sgd" 

) 

1.81 

1.65 

1.78 

1.50 

1.85 

1.83 

sgd(method="sgd") 


2.78 

2.90 

2.93 

2.81 


- 

glmnet 


6.60 

7.76 

8.00 

7.83 

6.50 

6.70 



N = 50, 

000 p = 

= 10,00C 

) (min) 


sgd(method="ai-sgd" 

) 

3.12 

3.51 

3.43 

3.26 

3.40 

3.38 

sgd(method="sgd") 


4.83 

4.86 

5.23 

- 

- 

- 

glmnet 


14.58 

15.28 

16.29 

15.58 

16.54 

16.41 



N 

= 1,000,000 p 

= 50,000 (min 

) 

sgd(method="ai-sgd" 

) 

22.23 

21.10 

19.88 

21.52 

18.53 

20.53 

sgd(method="sgd") 


27.80 

34.08 

- 

- 

- 

- 

glmnet 


- 


- 

- 

- 

- 



N 

= 10,000,000 p = 100,000 (hr 

) 

sgd(method="ai-sgd" 

) 

9.38 

10.20 

9.58 

8.54 

10.11 

10.74 

sgd(method="sgd") 


13.50 

- 

- 

- 

- 

- 

glmnet 


- 

- 

- 

- 

- 

- 


Table 1: Linear regression with the lasso. Timing (in various units) is displayed for 100 
A values, averaged over 10 runs. The first line is sgd using Ai-SGD and the second line is 
sgd using explicit SGD. Omitted entries indicate failure of the algorithm; for explicit SGD it 
numerically diverges, and for glmnet it could not run due to memory limitations. 


3.2. Logistic regression with ridge penalty 

Following benchmarks that are popular in the machine learning and optimization literature 
(Xu 2011; Shamir and Zhang 2012; Bach and Moulines 2013; Schmidt, Le Roux, and Bach 
2013), we perform large-scale logistic regression on four data sets: 

• rcvl (Lewis, Yang, Rose, and Li 2004): text data set in which the task is to classify 
documents belonging to class ccat, where we apply preprocessing provided by Bottou 
( 2012 ). 

• covtype (Blackard 1998): data set consisting of forest cover types in which the task is 
to classify for one specific class among 7 forest cover types. 
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description 

type 

covariates 

training set 

test set 

A 

covtype 

forest cover type 

sparse 

54 

464,809 

116,203 

10-6 

delta 

synthetic data 

dense 

500 

450,000 

50,000 

10-2 

rcvl 

text data 

sparse 

47,152 

781,265 

23,149 

10-6 

mnist 

digit image features 

dense 

784 

60,000 

10,000 

10-6 


Table 2: Summary of data sets and the L 2 regularization parameter A used. 


• delta (Sonnenburg, Franc, Yom-Tov, and Sebag 2008): synthetic data offered in the 
PASCAL Large Scale Challenge. We apply the default processing offered by the chal¬ 
lenge organizers. 

• mnist (Le Cun, Bottou, Bengio, and Haffner 1998): images of handwritten digits, where 
the task is to classify digit 9 against all others. 

A summary of the data sets is available in Table 3, where the number of observations are 
typically on the order of several hundred thousand, and the covariates range from a few dozen 
to tens of thousands. The regularization parameter A for the ridge penalty are set according 
to those used in Xu (2011). 

We compare to the following three packages: biglm (Lumley 2013) and speedglm (Enea, Meiri, 
and Kalimi 2015), both of which perform approximate updates using iteratively reweighted 
least squares, and LiblineaR (Helleputte 2015), which is a simple wrapper to a C-|--|- library 
for regularized linear classification. We use the stochastic dual coordinate ascent algorithm 
(Shalev-Shwartz and Zhang 2013) in LiblineaR. In addition, we consider the mnlogit package 
(Hasan, Zhiyu, and Mahani 2015), which implements multinomial logistic regression using the 
classical technique of Newton-Raphson, and exploits iterations over intermediate data struc¬ 
tures for fast Hessian calculations. For modest-sized problems, mnlogit is shown to be 10-50 
times faster than mlogit (Croissant 2013), VGAM (Yee 2010), and the multinom function in 
nnet (Venables and Ripley 2002). Finally, we also run the default function glm.f it as a base¬ 
line. We note that mnlogit and glm.fit can be only employed for standard (unregularized) 
multinomial regression, so we run them without the ridge penalty. 

Table 3 outlines the runtimes for the considered packages. The two SGD algorithms are orders 
of magnitude faster than its competitors on all data sets. Interestingly, biglm and speedglm 
failed to run on the three real data sets when attempting to invert subsets of the data, and 
only succeeded for the one synthetic data set delta. We also note that the largest data 
set—rcvl—failed for the majority of algorithms: only the packages sgd and LiblineaR were 
able to converge, both of which natively use stochastic gradients for computationally efficient 
updates. However, sgd is significantly faster because less overhead seems to be involved in 
passing data structures to perform computation in native C-|—1-. 

Moreover, sgd requires 0{p) memory, which is optimal in the sense that 0{p) is the minimum 
required for simply storing the iterate 0™. Both biglm and speedglm require 0{p^) for the 
inversion oiapxp matrix, as do mnlogit and glm.fit. The mnlogit package also requires data 
in the long format, which leads to a duplication of rows, as many entries display redundant 
information. Moreover, while exploitation of the Hessian structure can help in practice (as it 
outperforms glm.fit), we observe that the traditional technique of Newton-Raphson remains 
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data set 

sgd (ai-SGd) 

sgd (sgd) 

biglm 

speedglm 

LiblineaR 

mnlogit 

glm.fit 

covtype 

5.21 

7.58 

- 

- 

1444.78 

16.04 

40.11 

delta 

10.10 

10.23 

736.13 

30.50 

2167.14 

445.73 

498.97 

rcvl 

14.15 

15.42 

- 

~ 

133.10 

- 

- 

mnist 

3.50 

3.37 

~ 

- 

208.55 

232.53 

890.76 


Table 3: Large-scale logistic regression on four data sets. Timing (in seconds) is displayed, 
averaged over 10 runs. Omitted entries indicate failure of the algorithm; for biglm and 
speedglm, it could not run due to inversions of singular matrices; for mulogit it could not run 
due to memory limitations. 


covtype test error 


delta test error 


0.30 


0.28- 
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Figure 1: Large scale logistic regression on four data sets. Each plot indicates the classifica¬ 
tion error on the test set for explicit SGD with AdaGrad, Ai-SGD, averaged SGD, and explicit 
SGD over a pass of the data. 
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N 

p 

sgd (ai-sgd) 

sgd (sgd) 

hqreg 

units 

1,000 

100 

0.05 

0.04 

0.03 

(sec) 

10,000 

500 

0.55 

0.46 

0.40 

(sec) 

10,000 

1,000 

1.30 

2.22 

6.34 

(sec) 

50,000 

10,000 

3.12 

3.86 

15.57 

(min) 

100,000 

50,000 

8.13 

15.20 

- 

(min) 

1,000,000 

100,000 

35.88 

51.93 

- 

(min) 

10,000,000 

100,000 

8.64 

9.55 

- 

(hr) 

100,000 

1,000,000 

18.80 

26.43 

- 

(hr) 


Table 4: High-dimensional M-estimation with the Huber loss. Timing (in units given by the 
last column) is displayed for 100 A values, averaged over 10 runs. Omitted entries indicate 
failure of the algorithm; for hqreg, it could not run due to memory limitations. 


untenable because it still requires 0{Np^) complexity per iteration in the worst case. 

For demonstration, Figure 1 shows the progress of multiple SGD algorithms available in sgd (see 
Section 4.2) over a pass of the data. We note that Ai-SGD achieves the fastest or competitive 
convergence rates, without requiring significant tuning of parameters as the other algorithms 
do; this includes popular adaptive learning rate specifications, such as explicit SGD with 
AdaGrad. 

3.3. M-estimation with the Huber loss 

We follow an example for high-dimensional M-estimation in Donoho and Montanari (2013, 
Section 2.4). Define the convex function p : M —)■ M"*" to be the Huber loss, 

|^A|z| — A^/2, otherwise. 

Fix the thresholding parameter A = 3, and generate the N xp design matrix with i.i.d. entries 
X,,, ~AA(0,^). We fix the true set of parameters 0* to be a vector randomly drawn with 
fixed norm || 0*||2 = ^x/p, and then generate outcome y^, n = 1, 2,..., A^, as 

yn = xl0^ + en. (24) 

For the distribution of errors e^, we use Huber’s contaminated normal distribution CN(0.05,10), 
i.e., Cn ~ 0.952;-|-0.05/iio, i.i.d., where z is standard normal and hx is a point mass at x. 

Few alternative packages to sgd exist for high-dimensional robust estimation. We compare to 
hqreg (Yi 2015), which fits regularization paths for Huber loss regression with the elastic net 
penalty. Note that hqreg is specialized to the Huber loss and cannot perform estimation for 
the general setting of M-estimation problems considered here. 

Table 4 outlines results for a combination of pairs {N,p), ranging from small problems of 
N = 1,000 observations to massive data settings of = 10 million. We apply the elastic 
net penalty with a = 0.5, which puts even weight on both the lasso and ridge components, 
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Figure 2: High dimensional M-estimation with the Huber loss, for N = 100,000 observations, 
p = 10,000 covariates, and a fixed regularization parameter A. The plots indicate the mean- 
squared error across iterations (left) and time (right) for SGD algorithms. The horizontal line 
displays the mean-squared error for the exact M-estimator 0™. 


and then compute a regularization path for both packages. We also include an example of 
N = 100,000 observations and p = 1, 000, 000 covariates, where there exist far more covariates 
than data points; this occurs often in applications, e.g., in text analysis, bioinformatics, and 
signal processing (Lustig, Donoho, Santos, and Pauly 2008; Blei 2012). 


The SGD algorithms begin to outperform hqreg on the order of tens of thousands of obser¬ 
vations, and significantly so for larger data settings. Similar to the memory limitations of 
glmnet, hqreg requires access to the full data set per iteration of its algorithm, which is in¬ 
feasible when the data cannot be held in memory. Thus we were unable to obtain proper 
timings for data sets of size greater than 50, 000 observations and 10,000 covariates. 

Figure 2 displays the progress of the SGD algorithms for the setting of iV = 100,000 observa¬ 
tions and p = 10, 000 covariates, for a hxed regularization parameter A. For demonstration, 
we run the algorithms over 10 passes of the data and thus over a total of 1 million iterations. 
Ai-SGD is seen to achieve a significantly faster convergence rate than explicit SGD. We also 
consider the use of adaptive schedules, here with RMSProp, as it performs the fastest among 
other available learning rates (see Section 4.3). With RMSProp, the difference between the 
two methods— SGD and Ai-SGD — is noticeably smaller, and in fact SGD seems to converge 
slightly faster. We note however that the use of SGD algorithms with RMSProp breaks sta¬ 
tistical efficiency, and indeed we see this effect as the mean-squared error oscillates around a 
value higher than the MSE of the exact M-estimator (green line). Therefore we advocate the 
use of AI-SGD with a one-dimensional learning rate, which still converges quite quickly. 
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4. Interface and implementation 

We now discuss the interface of sgd and various technical details that are important for its 
use in practice. 

4.1. Interface 

The sgd package provides an intuitive and accessible set of methods for performing estimation 
with large-scale data sets. At the core of the package is the function 

sgd(formula, data, model, model.control, sgd.control) 

The user provides a formula on the data frame data —similar to function primitives, such as 
Im —and then specifies the model. The model parameters are estimated using SGD methods, 
which defaults to AI-SGD. The optional arguments model. control and sgd. control specify 
attributes one can tweak about the model and the stochastic gradient method, respectively. 
For example, given a data frame dat with response vector stored as the column y, 

sgd(y ~ ., data=dat, model="lm") 

fits a linear model with the default specifications, e.g., Ai-SGD with a one-dimensional learning 
rate. Similarly, 

sgd(y ~ ., data=dat, model="glm", model.control=list(family="binomial")) 

fits logistic regression with the default specifications. Numerous examples are available in the 
package by running demo(package="sgd"). 

The sgd function also interfaces with data sets that are too large to fit into memory or 
are streaming (more details in Section 4.4), and can be run with a custom loss function if 
desired. 

The output of the sgd function is a sgd object, which is a light wrapper on a list which 
collects quantities, such as the final parameter estimates and convergence diagnostics. Custom 
generic methods are also available for the sgd class, such as print, predict, and plot. 

4.2. Stochastic gradient methods 

While we describe the explicit SGD and Ai-SGD algorithms in Section 2, the following stochastic 
gradient methods are also implemented in sgd: 

• implicit SGD: Proposed by Toulis et al. (2014) in the context of generalized linear models, 
this algorithm uses the implicit update (2) and does not do any iterate averaging. 

• averaged SGD: Proposed by Ruppert (1988) and Bather (1989) independently, this 
algorithm uses the explicit update (1) followed by iterate averaging (3). 

• classical momentum (cm): Proposed by Polyak (1964), this algorithm uses the update 

Vn = fJ-Vn-i -ba„Vlog/(yn;x„,6'„_i), (25) 

en = 0n-l+Vn, (26) 
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where /x G [0,1] is a fixed momentum coefficient. CM accelerates gradient descent with 
a velocity vector which accumulates directions of large increase in the log-likelihood. 

• Nesterov’s accelerated gradient (nag): Proposed by Nesterov (1983), this algorithm 
uses the update 


Vn = lJ-Vn-1 -Fa„Vlog/(yn;Xrt,6'„_l + gLVn-l), (27) 

Gn = Gn-1 + Vn, (28) 

where g G [0,1] is a fixed momentum coefficient. NAG is similar to CM but accumulates 
velocity at a ’’look-ahead” point 6n-i + gVn-i- This makes a partial update closer to 
On, allowing NAG to change its velocity more quickly and responsively. 

While all these methods are available, we recommend and apply Ai-SGD as the default. It can 
be seen as an effective combination of the advantages from both implicit SGD and averaged 
SGD (Toulis et al. 2015). The momentum-based methods CM and NAG enjoy faster convergence 
rates than the original explicit SGD, but offer no theoretical benefits against Ai-SGD. Without 
averaging techniques they also are statistically inefficient, whereas iterate averaging can be 
interpreted as an acceleration technique because larger learning rates are used. The velocity 
update in NAG is also a proxy for the implicit update, as its benefit mostly relies on making 
updates close to where the new estimate would lie. 

4.3. Learning rates 

We describe the available learning rates in more detail because they are critical for convergence 
of SGD methods, in practice. It is well-known (Sakrison 1965; Amari 1998; Toulis et al. 
2014) that explicit SGD (1) and implicit SGD (2) have optimal statistical efficiency if the 
learning rate sequence 7 „ together, with the conditioning matrices C„, approximate the inverse 
Fisher information matrix X(0*) = —E (V^^(0*; x„, y^)), i.e., ^nCn —t X(0*)“^, in the limit. 
Therefore in hrst-order methods where Cn = I, the learning rate sequence acts as a scalar¬ 
valued approximation to the optimal rescaling as it is used in Fisher scoring (Fisher 1925). 
Based on this theory, the following learning rates are implemented in sgd: 

• One-dimensional (Xu 2011): The learning rate is of the form 

In = 7o(l +070^)“'", 

where jo,a,c G M are fixed constants. For SGD algorithms without iterate averaging 
and SGD algorithms with iterate averaging, Xu (2011) proved that setting c = 1 and 
c = 2/3, respectively, leads to optimal statistical efficiency; a similar result holds for 
AI-SGD (Toulis et al. 2015). 

• AdaGrad (Duchi et al. 2011): Rather than specify a one-dimensional learning rate 
7n G M, Duchi et al. (2011) propose a diagonal conditioning matrix Cn G given by 

Xn — In—1 T diag( V £(^6n—l, X^,, yn')^^{On—l, X^, yn) ) j 
a = r?(X„ + e/)-i/2^ 

where diag(-) extracts the diagonal entries of its matrix argument, r/ G M is a constant, I 
is the identity matrix, and e is a fixed value, typically 10“®, to prevent division by zero. 
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In the limit, In is an unbiased estimate of the diagonal entries of the Fisher information, 
and the proposed diagonal matrix Cn, which accumulates such curvature information, 
is proven to be optimal for minimization of the regret bound. 

• RMSProp (Tieleman and Hinton 2012): A learning rate which is popular in the deep 
learning literature (Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov 2014; 
Ranganath, Tang, Charlin, and Blei 2015; Rezende and Mohamed 2015), Tieleman and 
Hinton (2012) propose the diagonal conditioning matrix Cn G given by 

In = pin-i + (1 - /3)diag(V£(6»„_i;Xn,y„)V£(6'n_i;x„,y„)’^), 

Cn = r?(Xn + 

where /3 G [0,1] is the discount factor, r? G M is a constant, I is the identity matrix, 
and e is a fixed value to prevent division by zero, as in AdaGrad. RMSProp uses a 
decay in the estimate for the Fisher information by taking a weighted average, and 
thus it gives more weight onto newer than older information. RMSProp aims to offset 
one problem AdaGrad often encounters in practice, where very large values occur for 
initial estimates of In (e.g., due to poor initialization), thus slowing down the AdaGrad 
procedure as it tries to accumulate enough curvature information to compensate for such 
an error (Schaul, Antonoglou, and Silver 2014). RMSProp balances this by taking a 
weighted average of previous and new information, and sees much empirical success. One 
problem, however, is that RMSProp is no longer decaying sufficiently quickly (Robbins 
and Monro 1951; Duchi et al. 2011), and thus it has no guarantees on convergence. 
Moreover, assuming convergence, the limit of the learning rate sequence is a constant, 
which makes the iterates jitter around the true parameter value, ad infinitum. 

• Fisher: Following results on statistical efficiency and Fisher scoring, we propose a learn¬ 
ing rate using a diagonal conditioning matrix Cn G given by 

In — (1 1 T qndiag(V£(0fi,_l, X, 2 ) Yn)1) ^71) yn) ); 

Cn = {In + eI)-\ 

where 7 „ oc 1/n, and e is a small fixed value to prevent division by zero, as in AdaGrad. 
As before. In in the limit is an unbiased estimate of the diagonal Fisher information, 
and Cn is adaptive to curvature information. 

One critical but often unnoticed issue with AdaGrad, RMSProp, and similar adaptive sched¬ 
ules is that they are statistically inefficient: the specification of the learning rates leads to 
biased estimation of the inverse Fisher information matrix X(0*)“^ that, as mentioned ear¬ 
lier, is necessary for optimal statistical efficiency (an important exception is iterate averaging). 
This leads to a suboptimal asymptotic variance for the SGD procedure. Thus we recommend 
and apply the last proposed learning rate (“Fisher”) by default: it takes advantage of the 
curvature information such methods benefit from, while still preserving as much statistical 
efficiency as possible in diagonal conditioning matrices. 

4.4. Software integration 

For data sets that cannot be loaded into memory, we access subsets of the data using big- 
memory (Kane, Emerson, and Weston 2013). This allows one to perform stochastic gradient 
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descent by passing over the data loaded into RAM, and then to reload a new data set. This 
naturally applies to both large data sets, e.g., on the order of dozens of gigabytes, and stream¬ 
ing settings, in which one has access only to a subset of the (potentially infinite) data at a 
time. 

In principle, with bigmemory the memory requirement for these stochastic gradient methods 
is only a single data point and the current parameter estimate, which is the minimum 0{p) 
complexity for simply storing the estimate. In our implementation we use these savings to 
try to load as much data into RAM as possible. This speeds up convergence in practice, as 
it reduces the amount of I/O overhead; this especially becomes a significant bottleneck when 
reading many objects from disk. 

For fast implementations we use Repp (Eddelbuettel and Frangois 2011), where all algo¬ 
rithms are written in C-|—|- and only interface-level code is written in R. Aside from the major 
computational gains, this also provides the opportunity to extend the library to other pro¬ 
gramming languages. RcppArmadillo (Eddelbuettel and Sanderson 2014) is applied for access 
to pre-optimized linear algebra routines, and BH (Eddelbuettel, Emerson, and Kane 2015) for 
access to the Boost libraries. We apply template meta-programming and reusable classes in 
an object-oriented framework, including concepts such as stochastic gradient methods, mod¬ 
els, and learning rates. Such concepts make it easy for other users to develop new algorithms 
and prototype them in their own research or practices. 

The plotting routines adopt many features from ggplot2 (Wickham 2009), and are effectively 
templated ggplot objects. Our software is also robust through unit testing which follows the 
paradigm from testthat (Wickham 2011). 


5. Discussion 

As explicit SGD has been used extensively in practice, particularly in the deep learning com¬ 
munity, many heuristics have been proposed to solve issues that often occur. We describe 
several of these issues and their proposed solutions in the literature, and compare to how our 
sgd package handles them. 

Overfitting. As SGD algorithms simply minimize a loss function evaluted over the training 
data, overfitting is a prevalent problem as it is for all estimation methods. This is particularly 
an issue in complex likelihood functions such as neural networks (see, e.g., Giles (2001); Bakker 
and Heskes (2003)). Even with penalization terms that try to offset the fit of the parameters, it 
is still difficult for explicit SGD to find the right set of hyperparameters for such regularization 
without a computationally intensive search. 

As a solution many practictioners adopt early stopping^ which simply halts the optimization 
routine before it converges. However, there is little theory on the estimates obtained from 
early stopping. Most practically, it is difficult to know when to stop the algorithm and how 
to use it in combination with other regularization techniques, such as penalization. 

Fortunately, one of the advantages of Ai-SGD is that it requires less such tweaking: the implicit 
update effectively performs a regularization as seen from the Bayesian perspective, c.f., Section 
1. We’ve also seen in practice that penalization terms do not affect the final estimates from 
AI-SGD, which makes it less reliant on heuristics, such as early stopping. 
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Vanishing or exploding gradients. The numerical instability of explicit SGD is a widespread 
issue in practice (Bengio, Simard, and Frasconi 1994; Hochreiter 1998; Hochreiter, Bengio, 
Frasconi, and Schmidhuber 2001; Toulis et al. 2014). The stochastic gradients can easily 
be too large leading to divergence, and when chained through compositions of functions can 
either vanish to zero, or even explode to numerically infinite values; for example, Toulis et al. 
(2014) demonstrate the instability of explicit SGD in a simple bivariate Poisson model, where 
slight misspecification of learning rate parameters lead to divergence. 

Pascanu, Mikolov, and Bengio (2012) propose gradient clipping, which simply thresholds the 
stochastic gradient if it is outside a bounded interval. Unfortunately, while it can work in 
practice, it is a heuristic that breaks the key assumptions for convergence rate guarantees on 
SGD algorithms. Similarly, there is no principled way to set the bounds. For Ai-SGD algorithms 
applied to the settings we consider in Section 2, such an issue never arises. Theoretical results 
establish stability regardless of the specification of the learning rate (Toulis and Airoldi 2015a, 
Section 3), and perform well in practice, as seen in Section 3. 


6. Concluding remarks 


The sgd package is the most extensive implementation in R of stochastic gradient methods 
for estimation with massive and/or streaming data sets. Thus, sgd broadens the capabilities 
of R for estimation with modern large data sets—on the orders of hundreds of millions of 
observations and hundreds of thousands of covariates—while retaining desirable statistical 
properties. The software is based on solid theory of stochastic approximations, which help 
guide the optimal selection of parameters, e.g., learning rates, in the underlying optimiza¬ 
tion routines. In this paper, we show how sgd can be applied for estimation of generalized 
linear models and M-estimation, which comprise a sizeable portion of estimation problems 
encountered in statistical practice. 

There are many software extensions that are currently in development. We are working 
to interface with other high-performance computing packages, namely sqldf for faster I/O 
applications with streaming data, doParallel (Analytics and Weston 2014) and Rmpi (Yu 
2002) for parallel updates across environments, and gputools (Buckner, Seligman, and Wilson 
2011) for efficient computing with GPUs. The algorithms described here directly appeal 
to asynchronous implementations, following Hogwild! (Nui, Recht, Re, and Wright 2011), 
which allows for lock-free allocation of CPU cores. Sparse data structures would allow for 
fast structured matrix and vector products, which occur, for example, when looping over 
the covariates of a data point, and would significantly speed up computation on sparse data 
sets. 

Finally, there has been little attention on, and in fact a pressing need for, model selection 
and hypothesis testing in SGD procedures. We are pursuing this in light of the new statistical 
challenges presented to us while developing the sgd package. 
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